This notebook will go over the general steps on how to do a hypothesis test and go through a few examples. For questions, concerns, or anything else relevant to this notebook, please contact me
Here we will import some libraries needed for this demonstration. Make sure you run this cell
import numpy as np
from datascience import *
from ipywidgets import *
from utils import *
import warnings
warnings.filterwarnings("ignore")
Here is the general procedure in completing a hypothesis test:
for loop.Now that we have a framework on how to proceed, lets look at a few examples!
If you click on the links in each of the sub-sections below, they will bring you to the respective explanation above
Jonathan is a TA for Data 8, and notices that his students raise their left hands more often than their right hands.
He has $40$ students, and by his count, $27$ raise their left hand and $13$ raise their right hand.
Let's also define some variables to track the numbers in this observed event.
student_count = 40
observed_left = 27
observed_right = 13
Jonathan comes up with two ideas on what could be happening
He now must define a test statistic. He decides to do the Total Variation Distance (TVD). This will be computed as follows:
$$\frac{|\text{count of left hands} - \text{count of right hands}|}{2}$$He could have chosen something else such as:
$$ |\text{count of left hands} - 20|$$Let's define a function called tvd to help us compute this!
def tvd(left, right):
return abs(left - right)/2
Now let's calculate our observed value using the tvd function.
observed = tvd(observed_left, observed_right)
show(f"The observed value is: {observed}")
The observed value is: 7.0
Now we need to define a function called simulate which will simulate one event under the null hypothesis, and give us our test statistic for that event.
This function randomly choses "left" or "right" 40 times. It then tells us how many left and right hands it picked, with left first then right.
def simulate():
hands = make_array("left", "right")
choices = np.random.choice(hands, student_count)
left_count = sum(choices == "left")
right_count = sum(choices == "right")
return make_array(left_count, right_count)
Run the cell below a few times to check out what kind of values we are getting as we simulate under the null hypothesis.
simulate()
array([23, 17])
Now, under the null hypothesis, let's simulate $10,000$ random events and collect the test statistic values for each one in an array called results
results = make_array()
trials = 10_000
for i in np.arange(trials):
one_trial = simulate()
one_stat = tvd(one_trial.item(0), one_trial.item(1))
results = np.append(results, one_stat)
show(f"Collected all {trials:,} statistics")
Collected all 10,000 statistics
Great, now that we have all the statistics, let's compare these values with our observed value.
As an intermediate step, we'll create a table with the data in it
Check out the histogram below
Note: Don't worry about understanding the code below, just focus on the graph!
hands = Table().with_column("Hand TVD", results)
bin_range = np.arange(max(results))
fig = hands.ihist(show=False, unit="hand", bins = bin_range, title = "TVD of Hands")
show("You can hover over each bar to find the exact percent!")
fig
You can hover over each bar to find the exact percent!